Beta-Binomial (betabinom) Distribution#

The beta-binomial distribution models a count of successes in a fixed number of trials when the success probability is unknown and varies across repeated experiments.

A clean way to say it:

Draw a random probability \(p \sim \mathrm{Beta}(\alpha, \beta)\), then draw successes \(X\mid p \sim \mathrm{Binomial}(n, p)\).
The marginal distribution of \(X\) is beta-binomial.

It is a standard choice when binomial data are overdispersed (more variable than a binomial model allows).


Learning goals#

  • Understand the hierarchical story behind betabinom and when to use it

  • Write the PMF/CDF, compute moments, and interpret parameters \((n,\alpha,\beta)\)

  • Derive mean/variance from first principles

  • Sample and visualize the distribution (NumPy-only sampler)

  • Use scipy.stats.betabinom for PMF/CDF/RVS and implement a practical MLE “fit” routine

import math

import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots

from scipy import optimize, special, stats

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

np.set_printoptions(precision=6, suppress=True)
rng = np.random.default_rng(42)
import plotly
import scipy

print("numpy :", np.__version__)
print("pandas:", pd.__version__)
print("scipy :", scipy.__version__)
print("plotly:", plotly.__version__)
numpy : 1.26.2
pandas: 2.1.3
scipy : 1.15.0
plotly: 6.5.2

Prerequisites#

  • Binomial distribution basics: counts of successes in \(n\) trials

  • Beta distribution basics: a distribution on probabilities \(p \in (0,1)\)

  • Law of total expectation/variance

  • Comfort reading math in LaTeX

1) Title & classification#

  • Name: betabinom (beta-binomial)

  • Type: Discrete distribution

  • Support: $\( \mathcal{X} = \{0, 1, 2, \dots, n\} \)$

  • Parameter space: $\( n \in \{0,1,2,\dots\},\qquad \alpha > 0,\qquad \beta > 0 \)$

SciPy uses the names (n, a, b) for (n, \alpha, \beta).

2) Intuition & motivation#

What it models#

A binomial model assumes a single fixed success probability \(p\).

In many real settings, \(p\) is not constant:

  • users vary in propensity to click

  • batches vary in defect rate

  • clinics vary in response rate

A simple way to model this heterogeneity is to treat \(p\) as random:

\[ p \sim \mathrm{Beta}(\alpha, \beta), \qquad X \mid p \sim \mathrm{Binomial}(n, p) \]

The marginal distribution of \(X\) is beta-binomial.

Real-world use cases#

  • Overdispersed binomial counts: when sample variance exceeds \(n\hat p(1-\hat p)\)

  • Empirical Bayes for rates: estimate a prior over \(p\) across many groups

  • Posterior predictive for binomial + beta prior (conjugate Bayesian modeling)

Relations to other distributions#

  • Mixture: beta-binomial is a Beta–Binomial compound distribution (a mixture of binomials)

  • Conjugacy: Beta is conjugate prior for Binomial; beta-binomial is the predictive distribution

  • Pólya’s urn: beta-binomial counts appear as draws in an urn scheme with reinforcement

  • Dirichlet-multinomial: multivariate generalization (counts across multiple categories)

3) Formal definition#

Hierarchical definition#

\[ p \sim \mathrm{Beta}(\alpha, \beta), \qquad X \mid p \sim \mathrm{Binomial}(n, p) \]

PMF#

For \(k \in \{0,1,\dots,n\}\):

\[ \Pr[X = k] = \binom{n}{k}\,\frac{B(k+\alpha,\; n-k+\beta)}{B(\alpha,\beta)} \]

where \(B(\cdot,\cdot)\) is the beta function:

\[ B(x,y) = \frac{\Gamma(x)\Gamma(y)}{\Gamma(x+y)}. \]

A numerically stable computation uses the log form:

\[ \log \Pr[X=k] = \log\binom{n}{k} + \log B(k+\alpha, n-k+\beta) - \log B(\alpha,\beta). \]

CDF#

The CDF is the cumulative sum of the PMF:

\[ F(k) = \Pr[X \le k] = \sum_{i=0}^{\lfloor k \rfloor} \Pr[X=i]. \]

There is no simple elementary closed form; libraries compute it efficiently and stably.

def betabinom_logpmf(k: np.ndarray, n: int, a: float, b: float) -> np.ndarray:
    """Log PMF using stable special functions (vectorized).

    Parameters
    ----------
    k : array-like
        Counts in {0, ..., n}.
    n : int
        Number of trials.
    a, b : float
        Beta shape parameters (alpha, beta) > 0.
    """
    k = np.asarray(k)
    log_choose = (
        special.gammaln(n + 1)
        - special.gammaln(k + 1)
        - special.gammaln(n - k + 1)
    )
    return log_choose + special.betaln(k + a, n - k + b) - special.betaln(a, b)


def betabinom_pmf(k: np.ndarray, n: int, a: float, b: float) -> np.ndarray:
    return np.exp(betabinom_logpmf(k, n=n, a=a, b=b))


def betabinom_cdf(k: np.ndarray, n: int, a: float, b: float) -> np.ndarray:
    """CDF computed by cumulative sum of PMF on support."""
    ks = np.arange(0, n + 1)
    pmf = betabinom_pmf(ks, n=n, a=a, b=b)
    cdf_full = np.cumsum(pmf)
    k = np.asarray(k)
    k_clip = np.clip(np.floor(k).astype(int), 0, n)
    return cdf_full[k_clip]


def betabinom_entropy(n: int, a: float, b: float) -> float:
    """Shannon entropy in nats: H = -sum p log p."""
    ks = np.arange(0, n + 1)
    logp = betabinom_logpmf(ks, n=n, a=a, b=b)
    p = np.exp(logp)
    return float(-np.sum(p * logp))


def betabinom_stats_closed_form(n: int, a: float, b: float):
    """Mean, variance, skewness, excess kurtosis (Fisher)."""
    s = a + b
    p = a / s
    q = b / s
    mean = n * p
    var = n * p * q * (n + s) / (s + 1)

    # For n=0 the distribution is degenerate at 0; higher standardized moments are undefined.
    if n == 0 or var == 0:
        return float(mean), float(var), np.nan, np.nan

    skew = (s + 2 * n) * (b - a) * np.sqrt(s + 1) / ((s + 2) * np.sqrt(n * a * b * (s + n)))

    t = a * b
    kurt_excess = (
        (s + 1)
        * (
            s**4
            + (6 * n - 1) * s**3
            + (6 * n**2 + 3 * t * (n - 2)) * s**2
            - 3 * t * n * (6 - n) * s
            - 18 * t * n**2
        )
        / (n * t * (s + n) * (s + 2) * (s + 3))
        - 3
    )
    return float(mean), float(var), float(skew), float(kurt_excess)


def rvs_betabinom_numpy(n: int, a: float, b: float, size: int, rng: np.random.Generator) -> np.ndarray:
    """NumPy-only sampler via Gamma->Beta + Binomial.

    - If G1 ~ Gamma(a, 1) and G2 ~ Gamma(b, 1), then P = G1 / (G1 + G2) ~ Beta(a, b).
    - Then sample X ~ Binomial(n, P).
    """
    g1 = rng.gamma(shape=a, scale=1.0, size=size)
    g2 = rng.gamma(shape=b, scale=1.0, size=size)
    p = g1 / (g1 + g2)
    return rng.binomial(n, p)

4) Moments & properties#

Let \(X \sim \mathrm{BetaBinomial}(n,\alpha,\beta)\) and define \(s = \alpha+\beta\).

Mean#

\[ \mathbb{E}[X] = n\,\frac{\alpha}{\alpha+\beta}. \]

Variance#

\[ \mathrm{Var}(X) = n\,\frac{\alpha\beta}{(\alpha+\beta)^2}\,\frac{\alpha+\beta+n}{\alpha+\beta+1}. \]

This is often written as: $\( \mathrm{Var}(X) = n\,p(1-p)\,\frac{n+s}{s+1}, \qquad p=\frac{\alpha}{s}. \)$

Compared to binomial variance \(n p(1-p)\), the factor \((n+s)/(s+1) \ge 1\) produces overdispersion.

Skewness#

\[ \gamma_1 = \frac{(s+2n)(\beta-\alpha)\sqrt{s+1}}{(s+2)\sqrt{n\alpha\beta(s+n)}}. \]

Excess kurtosis (Fisher)#

A symmetric closed form is (let \(t=\alpha\beta\)):

\[ \gamma_2 = \frac{(s+1)}{n\,t\,(s+n)(s+2)(s+3)}\Big[ s^4 + (6n-1)s^3 + (6n^2 + 3t(n-2))s^2 - 3tn(6-n)s - 18tn^2 \Big] - 3. \]

PGF / MGF / characteristic function#

The probability generating function (PGF) can be written using the Gauss hypergeometric function:

\[ G(s) = \mathbb{E}[s^X] = {}_2F_1\left(-n,\;\alpha;\;\alpha+\beta;\;1-s\right). \]

Then:

  • MGF: \(M(t)=\mathbb{E}[e^{tX}] = G(e^t)\)

  • CF: \(\varphi(t)=\mathbb{E}[e^{itX}] = G(e^{it})\)

Because \(X \in \{0,\dots,n\}\) is bounded, all moments exist.

Entropy#

There is no simple closed form in elementary functions; it is typically computed numerically:

\[ H(X) = -\sum_{k=0}^n \Pr[X=k]\,\log \Pr[X=k]. \]

Intraclass correlation (trial-to-trial dependence)#

In the hierarchical view, the Bernoulli trials are conditionally independent given \(p\), but marginally correlated.

The correlation between two distinct trials in the same group is:

\[ \rho = \frac{1}{\alpha+\beta+1} = \frac{1}{s+1}. \]

Binomial data correspond to the limiting case \(\rho \to 0\) (equivalently \(s \to \infty\)).

# Quick numerical check: closed-form moments vs SciPy
n, a, b = 10, 2.0, 3.0
mean_cf, var_cf, skew_cf, kurt_cf = betabinom_stats_closed_form(n, a, b)
mean_sp, var_sp, skew_sp, kurt_sp = stats.betabinom.stats(n, a, b, moments="mvsk")

pd.DataFrame(
    {
        "closed_form": [mean_cf, var_cf, skew_cf, kurt_cf],
        "scipy": [float(mean_sp), float(var_sp), float(skew_sp), float(kurt_sp)],
    },
    index=["mean", "variance", "skewness", "excess_kurtosis"],
)
closed_form scipy
mean 4.000000 4.000000
variance 6.000000 6.000000
skewness 0.291606 0.291606
excess_kurtosis -0.690476 -0.690476
# Entropy (numerical) + PGF/MGF evaluation via hypergeometric function
H = betabinom_entropy(n=10, a=2.0, b=3.0)

def betabinom_pgf(s: complex, n: int, a: float, b: float) -> complex:
    return complex(special.hyp2f1(-n, a, a + b, 1 - s))

M0 = betabinom_pgf(1.0, n=10, a=2.0, b=3.0)  # should be 1
M_small = betabinom_pgf(np.exp(0.2), n=10, a=2.0, b=3.0)

H, M0, M_small
(2.2577232124962414, (1+0j), (2.518957094288793+0j))

5) Parameter interpretation#

\(n\) (trials)#

  • Controls the maximum number of successes.

  • Scaling \(n\) changes the range and (often) the granularity of the PMF.

\((\alpha,\beta)\) (Beta prior over \(p\))#

Interpretations:

  • Prior mean of the success probability: $\( \mathbb{E}[p] = \frac{\alpha}{\alpha+\beta} \)$

  • Prior “concentration” (how strongly \(p\) is concentrated around its mean): $\( s = \alpha+\beta \)\( Larger \)s\( means **less heterogeneity** in \)p\( and a distribution closer to Binomial\)(n,\mathbb{E}[p])$.

Heuristic pseudo-count view: \(\alpha-1\) prior successes and \(\beta-1\) prior failures.

Shape changes#

  • Fix \(p=\alpha/(\alpha+\beta)\) and increase \(s\): PMF tightens around \(np\) (approaches binomial).

  • Fix \(s\) and vary \(p\): the mass shifts left/right.

  • If \(\alpha,\beta<1\) the Beta prior is U-shaped, so \(p\) often lands near 0 or 1 → PMF puts more mass near 0 and \(n\).

# Effect of concentration s = a+b at fixed mean p
n = 20
p = 0.30
concentrations = [2, 5, 20, 200]

ks = np.arange(n + 1)
frames = []
for s in concentrations:
    a = p * s
    b = (1 - p) * s
    pmf = stats.betabinom.pmf(ks, n, a, b)
    frames.append(
        pd.DataFrame(
            {
                "k": ks,
                "pmf": pmf,
                "concentration (a+b)": f"{s:g}",
            }
        )
    )

df = pd.concat(frames, ignore_index=True)
fig = px.line(
    df,
    x="k",
    y="pmf",
    color="concentration (a+b)",
    markers=True,
    title="Beta-binomial PMF: increasing concentration approaches a Binomial",
    labels={"k": "successes k", "pmf": "P(X=k)"},
)
fig.show()
# Effect of changing mean p at fixed concentration s
n = 20
s = 10
ps = [0.1, 0.3, 0.5, 0.7]

ks = np.arange(n + 1)
frames = []
for p in ps:
    a = p * s
    b = (1 - p) * s
    pmf = stats.betabinom.pmf(ks, n, a, b)
    frames.append(pd.DataFrame({"k": ks, "pmf": pmf, "p": f"{p:.1f}"}))

df = pd.concat(frames, ignore_index=True)
fig = px.line(
    df,
    x="k",
    y="pmf",
    color="p",
    markers=True,
    title="Beta-binomial PMF: varying mean p shifts the mass",
    labels={"k": "successes k", "pmf": "P(X=k)", "p": "mean p"},
)
fig.show()

6) Derivations#

Expectation#

Use the law of total expectation:

\[ \mathbb{E}[X] = \mathbb{E}\big[\,\mathbb{E}[X\mid p]\,\big] = \mathbb{E}[n p] = n\,\mathbb{E}[p] = n\,\frac{\alpha}{\alpha+\beta}. \]

Variance#

Use the law of total variance:

\[ \mathrm{Var}(X) = \mathbb{E}[\mathrm{Var}(X\mid p)] + \mathrm{Var}(\mathbb{E}[X\mid p]). \]

For a binomial conditional on \(p\):

  • \(\mathbb{E}[X\mid p]=np\)

  • \(\mathrm{Var}(X\mid p)=np(1-p)\)

So:

\[ \mathbb{E}[\mathrm{Var}(X\mid p)] = n\,\mathbb{E}[p(1-p)] \qquad \mathrm{Var}(\mathbb{E}[X\mid p]) = n^2\,\mathrm{Var}(p). \]

Using Beta moments:

\[ \mathbb{E}[p] = \frac{\alpha}{s}, \qquad \mathrm{Var}(p) = \frac{\alpha\beta}{s^2(s+1)}. \]

After algebra, this yields:

\[ \mathrm{Var}(X) = n\,\frac{\alpha\beta}{s^2}\,\frac{n+s}{s+1}. \]

Likelihood#

For a single observation \(k\) (with fixed \(n\)), the likelihood is just the PMF viewed as a function of \((\alpha,\beta)\):

\[ L(\alpha,\beta\mid k) = \binom{n}{k}\,\frac{B(k+\alpha, n-k+\beta)}{B(\alpha,\beta)}. \]

For i.i.d. observations \(k_1,\dots,k_m\) (same \(n\)), the log-likelihood is:

\[ \ell(\alpha,\beta) = \sum_{i=1}^m\Big[ \log\binom{n}{k_i} + \log B(k_i+\alpha, n-k_i+\beta) - \log B(\alpha,\beta) \Big]. \]

In practice we compute this in log-space (e.g., using scipy.special.betaln).

# Monte Carlo check of mean/variance
n, a, b = 30, 2.0, 5.0
x = rvs_betabinom_numpy(n=n, a=a, b=b, size=200_000, rng=rng)

mean_mc = float(x.mean())
var_mc = float(x.var(ddof=0))

mean_cf, var_cf, *_ = betabinom_stats_closed_form(n, a, b)
pd.DataFrame(
    {
        "monte_carlo": [mean_mc, var_mc],
        "closed_form": [mean_cf, var_cf],
    },
    index=["mean", "variance"],
)
monte_carlo closed_form
mean 8.572700 8.571429
variance 28.446845 28.316327

7) Sampling & simulation#

The beta-binomial sampler is a direct implementation of the hierarchical model:

  1. Sample a probability \(p \sim \mathrm{Beta}(\alpha,\beta)\)

  2. Sample a count \(X \sim \mathrm{Binomial}(n,p)\)

To keep this NumPy-only, we sample \(p\) via the Gamma ratio construction:

\[ G_1 \sim \mathrm{Gamma}(\alpha,1),\quad G_2 \sim \mathrm{Gamma}(\beta,1),\quad p = \frac{G_1}{G_1+G_2} \sim \mathrm{Beta}(\alpha,\beta). \]

Then we sample \(X\sim\mathrm{Binomial}(n,p)\).

n, a, b = 20, 1.5, 3.0
samples = rvs_betabinom_numpy(n=n, a=a, b=b, size=20_000, rng=rng)
samples[:10], samples.mean(), samples.var(ddof=0)
(array([ 2, 11, 11,  7,  0,  4,  5,  5,  7, 12]), 6.65555, 19.7927041975)

8) Visualization#

We’ll plot:

  • the PMF on \(\{0,\dots,n\}\)

  • the CDF as a cumulative sum

  • a Monte Carlo approximation (histogram of samples)

n, a, b = 20, 1.5, 3.0
ks = np.arange(n + 1)

pmf = stats.betabinom.pmf(ks, n, a, b)
cdf = stats.betabinom.cdf(ks, n, a, b)

fig = make_subplots(rows=1, cols=2, subplot_titles=["PMF", "CDF"])

fig.add_trace(go.Bar(x=ks, y=pmf, name="PMF"), row=1, col=1)
fig.add_trace(go.Scatter(x=ks, y=cdf, mode="lines+markers", name="CDF"), row=1, col=2)

fig.update_xaxes(title_text="k (successes)", row=1, col=1)
fig.update_xaxes(title_text="k (successes)", row=1, col=2)
fig.update_yaxes(title_text="P(X=k)", row=1, col=1)
fig.update_yaxes(title_text="P(X\u2264k)", row=1, col=2)
fig.update_layout(title=f"Beta-binomial(n={n}, a={a}, b={b})", showlegend=False)

fig.show()
# Monte Carlo histogram vs PMF
samples = rvs_betabinom_numpy(n=n, a=a, b=b, size=50_000, rng=rng)
counts = np.bincount(samples, minlength=n + 1)
pmf_mc = counts / counts.sum()

fig = go.Figure()
fig.add_trace(go.Bar(x=ks, y=pmf_mc, name="Monte Carlo", opacity=0.6))
fig.add_trace(go.Scatter(x=ks, y=pmf, mode="lines+markers", name="True PMF"))
fig.update_layout(
    title="Beta-binomial: empirical frequency vs true PMF",
    xaxis_title="k (successes)",
    yaxis_title="probability",
)
fig.show()

9) SciPy integration (scipy.stats.betabinom)#

SciPy provides:

  • pmf, logpmf

  • cdf

  • rvs

  • stats (mean/var/skew/kurt)

About .fit#

As of SciPy 1.15, scipy.stats.betabinom (a discrete distribution) does not expose a built-in .fit method.

In practice you can fit \((\alpha,\beta)\) by maximizing the log-likelihood with scipy.optimize.

Below is a minimal but usable MLE routine that assumes known \(n\).

n, a, b = 20, 1.5, 3.0
ks = np.arange(n + 1)

pmf_sp = stats.betabinom.pmf(ks, n, a, b)
cdf_sp = stats.betabinom.cdf(ks, n, a, b)
x_sp = stats.betabinom.rvs(n, a, b, size=10, random_state=rng)

pmf_sp[:5], cdf_sp[:5], x_sp
(array([0.051885, 0.070752, 0.080018, 0.084018, 0.084571]),
 array([0.051885, 0.122637, 0.202655, 0.286674, 0.371245]),
 array([1, 0, 1, 4, 2, 8, 6, 6, 8, 6]))
def betabinom_moments_init(data: np.ndarray, n: int):
    """Method-of-moments style initializer for (a, b).

    If empirical variance <= binomial variance, we return a large concentration
    (close to a binomial model).
    """
    data = np.asarray(data)
    mu = float(data.mean())
    v = float(data.var(ddof=0))

    p = np.clip(mu / n, 1e-6, 1 - 1e-6)
    v_bin = n * p * (1 - p)

    if v <= v_bin * (1 + 1e-6):
        s = 1e6
    else:
        r = v / v_bin
        s = (n - r) / (r - 1)
        s = max(s, 1e-3)

    a0 = p * s
    b0 = (1 - p) * s
    return float(a0), float(b0)


def fit_betabinom_mle(data: np.ndarray, n: int):
    """Fit (a,b) by MLE with fixed n.

    Uses an unconstrained parameterization a=exp(theta0), b=exp(theta1).
    """
    data = np.asarray(data)
    if data.ndim != 1:
        raise ValueError("data must be 1D")
    if np.any((data < 0) | (data > n)):
        raise ValueError("data must be in {0,...,n}")
    if not np.issubdtype(data.dtype, np.integer):
        # allow float arrays with integer values
        if not np.all(np.isclose(data, np.round(data))):
            raise ValueError("data must be integer-valued")
        data = np.round(data).astype(int)

    a0, b0 = betabinom_moments_init(data, n=n)
    x0 = np.log([a0, b0])

    def nll(theta):
        a = float(np.exp(theta[0]))
        b = float(np.exp(theta[1]))
        return -float(np.sum(stats.betabinom.logpmf(data, n, a, b)))

    res = optimize.minimize(nll, x0=x0, method="L-BFGS-B")
    a_hat, b_hat = np.exp(res.x)
    return {
        "a_hat": float(a_hat),
        "b_hat": float(b_hat),
        "success": bool(res.success),
        "nll": float(res.fun),
        "message": res.message,
    }
# Demo: recover parameters from synthetic data
rng_fit = np.random.default_rng(123)

n_true, a_true, b_true = 25, 2.0, 5.0
data = stats.betabinom.rvs(n_true, a_true, b_true, size=2_000, random_state=rng_fit)

fit = fit_betabinom_mle(data, n=n_true)
fit
{'a_hat': 2.041290796966522,
 'b_hat': 5.036609641147764,
 'success': True,
 'nll': 5706.819056152122,
 'message': 'CONVERGENCE: RELATIVE REDUCTION OF F <= FACTR*EPSMCH'}

10) Statistical use cases#

Hypothesis testing (binomial vs overdispersed alternative)#

If you have repeated groups with the same \(n\) (e.g., 100 clinics each with \(n\) patients), a binomial model assumes all groups share a single \(p\).

Beta-binomial is a common alternative when the observed variability across groups is too large.

Practical approaches:

  • Model comparison: compare binomial vs beta-binomial by log-likelihood, AIC/BIC, or held-out predictive performance.

  • Overdispersion testing: parameterize by mean \(p\) and intraclass correlation \(\rho=1/(\alpha+\beta+1)\). The null is \(\rho=0\) (binomial), which is a boundary case, so classical LRT chi-square theory needs care; a parametric bootstrap is often used.

Bayesian modeling#

With prior \(p\sim\mathrm{Beta}(\alpha_0,\beta_0)\) and data \(x\) successes out of \(n\) trials:

  • Posterior: \(p\mid x \sim \mathrm{Beta}(\alpha_0+x,\;\beta_0+n-x)\)

  • Posterior predictive for \(m\) future trials: $\( Y\mid x \sim \mathrm{BetaBinomial}(m,\;\alpha_0+x,\;\beta_0+n-x) \)$

Generative modeling#

Beta-binomial is a compact generative model for count data with bounded support when you expect latent heterogeneity in probabilities.

Examples:

  • modeling per-user conversions out of impressions

  • modeling per-batch defects out of inspected items

  • modeling per-class accuracy counts out of attempts (with varying difficulty)

# Model comparison example: Binomial vs Beta-binomial on synthetic overdispersed data
rng_use = np.random.default_rng(999)

n = 30
a_true, b_true = 1.2, 3.5
data = stats.betabinom.rvs(n, a_true, b_true, size=500, random_state=rng_use)

# Binomial MLE: p_hat = mean/n
p_hat = float(data.mean() / n)
ll_binom = float(np.sum(stats.binom.logpmf(data, n, p_hat)))
aic_binom = 2 * 1 - 2 * ll_binom

# Beta-binomial MLE (a,b), fixed n
fit = fit_betabinom_mle(data, n=n)
ll_bb = -fit["nll"]
aic_bb = 2 * 2 - 2 * ll_bb

pd.DataFrame(
    {
        "model": ["binomial", "beta-binomial"],
        "log_likelihood": [ll_binom, ll_bb],
        "AIC": [aic_binom, aic_bb],
    }
)
model log_likelihood AIC
0 binomial -2510.347071 5022.694142
1 beta-binomial -1527.653209 3059.306418
# Bayesian posterior predictive: Beta prior + Binomial likelihood -> Beta posterior -> Beta-binomial predictive
rng_bayes = np.random.default_rng(2024)

# Prior on p
a0, b0 = 2.0, 2.0

# Observed data: x successes out of n_obs
n_obs, x_obs = 20, 8
a_post, b_post = a0 + x_obs, b0 + (n_obs - x_obs)

# Predict m future trials
m = 15
ks = np.arange(m + 1)
pmf_pred = stats.betabinom.pmf(ks, m, a_post, b_post)

# Monte Carlo posterior predictive simulation for comparison
p_samp = rng_bayes.beta(a_post, b_post, size=100_000)
y_samp = rng_bayes.binomial(m, p_samp)
pmf_mc = np.bincount(y_samp, minlength=m + 1) / len(y_samp)

fig = go.Figure()
fig.add_trace(go.Bar(x=ks, y=pmf_mc, name="posterior predictive (MC)", opacity=0.6))
fig.add_trace(go.Scatter(x=ks, y=pmf_pred, mode="lines+markers", name="beta-binomial closed form"))
fig.update_layout(
    title="Posterior predictive: Beta-Binomial matches Monte Carlo",
    xaxis_title="future successes k",
    yaxis_title="probability",
)
fig.show()

11) Pitfalls#

Invalid parameters#

  • n must be a nonnegative integer.

  • a and b (i.e., \(\alpha,\beta\)) must be strictly positive.

SciPy will typically return nan for invalid parameters.

Numerical issues#

  • Naively computing \(\binom{n}{k}\) and \(B(\cdot,\cdot)\) can overflow/underflow for moderate/large \(n\).

  • Prefer logpmf / betaln / gammaln and only exponentiate at the end.

  • When comparing very small probabilities, compare log probabilities.

Estimation issues#

  • If data are close to binomial (little overdispersion), the MLE tends to push \(\alpha+\beta \to \infty\). Numerically, this can lead to large parameter estimates; using a mean/dispersion parameterization \((p,\rho)\) can be more stable.

# Invalid parameters in SciPy
ks = np.arange(6)
stats.betabinom.pmf(ks, n=5.5, a=1.0, b=2.0)  # invalid n (not integer)
array([nan, nan, nan, nan, nan, nan])
# Underflow demonstration: compare pmf vs logpmf for larger n
n, a, b = 500, 2.0, 5.0
k = 250

pmf_direct = stats.betabinom.pmf(k, n, a, b)
logpmf = stats.betabinom.logpmf(k, n, a, b)
pmf_from_log = float(np.exp(logpmf))

pmf_direct, logpmf, pmf_from_log
(0.001878631572159702, -6.277211654327845, 0.001878631572159702)

12) Summary#

  • betabinom is a discrete distribution on \(\{0,\dots,n\}\).

  • It arises as a Beta–Binomial mixture: \(p\sim\mathrm{Beta}(\alpha,\beta)\) then \(X\mid p\sim\mathrm{Binomial}(n,p)\).

  • Mean and variance: $\(\mathbb{E}[X]=n\frac{\alpha}{\alpha+\beta},\qquad \mathrm{Var}(X)=n\frac{\alpha\beta}{(\alpha+\beta)^2}\frac{n+\alpha+\beta}{\alpha+\beta+1}.\)$

  • Compared to a binomial with the same mean, beta-binomial allows overdispersion (extra variability) driven by heterogeneity in \(p\).

  • SciPy provides PMF/CDF/RVS; for fitting \((\alpha,\beta)\) you can maximize the log-likelihood with scipy.optimize.

Suggested exercises#

  1. Fix \(n\) and \(p\) and vary \(s=\alpha+\beta\): quantify how quickly the beta-binomial approaches a Binomial.

  2. Derive factorial moments \(\mathbb{E}[(X)_r]\) using the hierarchical model.

  3. Reparameterize by \((p,\rho)\) with \(\rho=1/(\alpha+\beta+1)\) and fit in that space.